This project explores housing data from Kaggle: Housing Price Competition.
Housing data as downloaded from Kaggle has quite a few missing values. For this project, all such columns have been dropped from the data set. Also, there were 62 variables in the original dataset. Final dataset used for the project has 1460 observations and 30 variables.
## 'data.frame': 1456 obs. of 30 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
Here is a brief description of the variables that has been exlored in the project:
Since the distribution has slight positive skew, it is log transformed to get a normal distribution.
These are various conditions of sale as per documentation:
While most of the sales were under normal conditions, sale of new houses which were partially complete fetched higher price.
These are various types of sale as per documentation:
Most of the sales were conventional warranty deeds or homes that were just constructed and sold, with new homes sold at higher prices.
Type of dwelling in the data has been defined in the data by two variables:
BldgType - which is general description of a building,
and MSSubClass that identifies the type of dwelling involved in the sale in more detail:
There are more of single-family detached building type, with most from 1946 or newer, single or double storeyed dwellings.
Density plot of sale price for all five building types
Houses with seemingly more privacy like Single-family Detached houses and Townhouse End Units fetched higher prices than the others. Also, although duplex houses in general weren’t as expensive, newer two storeyed houses were sold at higher price.
MSZoning: Identifies the general zoning classification of the sale.
* A Agriculture
* C Commercial
* FV Floating Village Residential
* I Industrial
* RH Residential High Density
* RL Residential Low Density
* RP Residential Low Density Park
* RM Residential Medium Density
Most of the houses are from residential low and medium density zones and houses from residential low density zones were pricier.
OverallQual: Rates the overall material and finish of the house
Most of the houses in the dataset are average and better in quality at the time of sale. And as expected, good quality houses have higher sale price.
Age of a building as perceived at the time of sale is taken as difference between year of sale and year of remodelling which has been taken to be same as construction date if no remodeling or additions.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 4 14 23 41 60
Most of the properties in the dataset are of age 20 years or lesser. It will be interesting to explore the sale price of a building with respect to it’s age.
Although there is a variation in total sales for each month of a year, distribution of sale price doesn’t vary much across the year.
Let us see if there is any preference for some specific neighbourhoods.
Neighborhood: Physical locations within Ames city limits
There does seem to be distinct price preference for some neighbourhoods like, Northridge, Northridge Heights and Stone Brook. This preference will be explored later in bivariate and multivariate plots sections.
Since the distribution of lot size distribution is highly skewed, log transformation of LotArea is done to get normalise the distribution.
LotShape: General shape of property
LandContour: Flatness of the property
LandSlope: Slope of property
Most of the properties are regular shaped or slightly irregular, with gentle slope and level contour. Interestingly, moderately irregular plots were sold at higher price.
Utilities: Type of utilities available
Most of the property have all the utilities available.
LotConfig: Lot configuration as defined in the dataset
Most of the lots were either inside or corner lots. Also, houses in cul-de-sacs and the houses with three side frontage are pricier.
Street: Type of road access to property
Most of the houses have paved street access and as expected houses with paved access fetch higher price than ones with gravel access.
Ground living area distribution is normalised by log transformation before plotting.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1128 1458 1507 1775 3627
Garage area distribution is normalised by log transformation.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 329.5 478.5 471.6 576.0 1390.0
Basement area distribution is normalised by log transformation.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 795.0 990.5 1050.7 1293.8 3206.0
More rooms could mean bigger houses hence higher prices.
There are different type of roof defined for houses:
Most of the houses had either gabled or hipped roofs, with some of these houses having higher ground living area. Looking at the sale price distribtion, some of the houses with these roofs did fetch higher price too. A combined plot of sale price with groung level area can reveal if higher area did correspond to higher price, which will be explored in the bivariate plots section.
KitchenQual: Kitchen quality
While most of the houses had average or good kitchen, kitchen in excellent condition did influence the sale price of the house.
While bathroom to bedroom ratio does seem to influence the sale price to some extent, too many bathrooms per room don’t seem to matter much.
The tidy housing dataset used for the project has 1456 observations and 31 variables.
Main features of interest are in the dataset are:
Some of the features which might help with investigation of the main features are:
Apart from these, others like number of bathrooms, lot shape and configutaion also seem to have some influence on the sale price. Also, it would be important to explore if any of these features are correlated.
Following new variables very created from existing variables in the dataset:
BldgAge: Age of the building
A new variable for the age of a property at the time of it’s sale was created. If the structure was remodeled at a later date since it’s constuction, then the age was taken as YrSold - YearRemodAdd else as YrSold - YearBuilt.
bathPerRoom: Number of bathrooms per bed room
Houses have full baths and half baths as per the dataset. Total number of bathrooms in house is taken as FullBath + 0.5 * HalfBath and the ratio bathPerRoom is calculated by dividing the number of bathrooms by number of bedrooms BedroomAbvGr.
Following features had skewed distribution, which has been normalised using log tranformation:
Normalising a data is imperative before applying any statistical analysis, as most of the statistical tests assume the data to be normally distributed.
While, as expected, OverallQual is very highly correlated with the SalePrice, some features like GrLivArea, GarageArea, TotalBsmtSF also have good correlation with the target variable. Others like YearRemodAdd, YearBuilt can also be good predictor of the SalePrice of the property. Also there are some features which are correlated to each other, such as YearRemodAdd with BldgAge and GrLivArea with TotRmsAbvGrd.
There is a correlation of 0.4 between sale price of the house and it’s lot area.
There is a definite correlation of 0.73 between sale price of the house and it’s living area.
There is a correlation of 0.46 between sale price of the house and it’s garage area.
There is a correlation of 0.37 between sale price of the house and it’s basement area.
There does seem to be a positive correlation between overall quality of the property and the year it was built, with more recent property having better overall quality as compared to older sructures. There were few older houses which were in good condition.
More percentage of houses in neighbourhoods like Northridge, Northridge Heights, Somerset, Stone Brook and Bloomington Heights were of better quality. Interestingly though, as noticed in univariate analysis, Northridge, Northridge Heights and Stone Brook were pricier than others, inspite of good quality houses in other neighbourhoods.
Let us consider all buildings older than 20 years as ‘Old’.
Northridge, Northridge Heights, Somerset and Bloomington Heights had all new houses, while more than 75% houses in College Creek, Gilbert and Stone Brook were new.
It was observed earlier that some of the neighbourhoods had higher priced houses as compared to others. May be some of the neighbourhoods have old houses and people don’t prefer them? Let us divide the neighbourhood into three groups, affluent(Level 1) with average sale price above 250,000, not so affluent(Level 2) with average sale price between 250,000 to 150,000 and not affluent(Level 3) with average sale price below 150,000, and take note at the age of the building, at the time of the sale, in each group.
| Neighborhood | affluence |
|---|---|
| Blmngtn | Level 2 |
| Blueste | Level 3 |
| BrDale | Level 3 |
| BrkSide | Level 3 |
| ClearCr | Level 2 |
| CollgCr | Level 2 |
| Crawfor | Level 2 |
| Edwards | Level 3 |
| Gilbert | Level 2 |
| IDOTRR | Level 3 |
| MeadowV | Level 3 |
| Mitchel | Level 2 |
| NAmes | Level 3 |
| NoRidge | Level 1 |
| NPkVill | Level 3 |
| NridgHt | Level 1 |
| NWAmes | Level 2 |
| OldTown | Level 3 |
| Sawyer | Level 3 |
| SawyerW | Level 2 |
| Somerst | Level 2 |
| StoneBr | Level 1 |
| SWISU | Level 3 |
| Timber | Level 2 |
| Veenker | Level 2 |
## df$affluence: Level 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 154000 254000 302000 314627 367294 625000
## --------------------------------------------------------
## df$affluence: Level 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 76000 165000 190000 200649 230000 424870
## --------------------------------------------------------
## df$affluence: Level 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 110000 130000 132525 149000 475000
Most of the older houses were sold below 200K US$ with a marked dip in sale price for houses beyond 20 years. As expected more affluent neighbourhoods have newer houses as compared to others, although some older houses also did seem to fetch good price. Correlation between sale price and the age of the buiilding is -0.52.
We can look at the sale price distribution of the new and old buildings separately, with buildings above 20 years of age marked as “Old”.
Correlation of sale price with building age for new buildings is -0.18, while for older buildings it is -0.4.
Let us check if proximity to certain amenities affect neighbourhoods affluence.
Condition1: Proximity to various conditions
Here observationas are grouped by neighbourhood and mean sale price for each proximity condition is plotted.
## df$Condition1: Artery
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 66500 105000 119550 135092 143000 475000
## --------------------------------------------------------
## df$Condition1: Feedr
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 40000 120825 139500 142256 167750 244600
## --------------------------------------------------------
## df$Condition1: Norm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 131500 165800 183596 219428 625000
## --------------------------------------------------------
## df$Condition1: PosA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 180000 188750 212500 225875 244000 335000
## --------------------------------------------------------
## df$Condition1: PosN
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 109500 166125 206000 216875 257375 385000
## --------------------------------------------------------
## df$Condition1: RRAe
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 87000 127750 142500 138400 156500 171000
## --------------------------------------------------------
## df$Condition1: RRAn
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 79500 152394 171495 184397 190105 423000
## --------------------------------------------------------
## df$Condition1: RRNe
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 187000 188875 190750 190750 192625 194500
## --------------------------------------------------------
## df$Condition1: RRNn
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 110000 128000 214000 212400 290000 320000
while most of the affluent neighbourhoods had normal proximity to various conditions and people might be willing to pay for the houses there for various other factors, proximity to positive off-site features like park, greenbelt etc did matter to certain extent.
##
## 2 3 4 5 6 7 8 9 10 11 12 14
## New 0 9 37 118 199 213 125 58 28 13 4 1
## Old 1 8 60 157 203 116 62 17 17 4 6 0
Many of the bigger houses of all ages did have more rooms, with correlation of 0.54
Looks like bigger houses were in better condition too.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.5000 0.6250 0.6409 0.8333 2.0000
Correlation between the total number of bathrooms and the living area of a house is 0.71, which is quite high.
Correlation between the total number of bathrooms and the age of a house is -0.45, which is moderate and negative.
Two main features of the dataset, SalePrice and OverallQual are highly correlated (correlation = 0.8) with each other. Some of the features have high to moderate correlation with SalePrice, like
As expected, there was a positive correlation between overall quality of the property and the year it was built, with more recent property having better overall quality as compared to older sructures. Not surprisingly, newer houses were more expensive.
While most of the houses had average or good kitchen, kitchen in excellent condition did influence the sale price of the house. It would be worthwhile to investigate this in terms of older houses fetching good price.
Quite a few expensive properties were near positive off-site feature–park, greenbelt, etc. People paid high premium for such features.
Although mostly old houses were cheaper, some of the houses did fetch good prices. Some of these houses were in the premium neighbourhoods. Also, older houses had fewer bathooms and as expected bigger houses had more bathrooms.
It was very interesting to observe that bigger houses were also in better condition, hence fetched better price too.
There was preference for some neighbourhoods like, Northridge, Northridge Heights and Stone Brook. These neighbourhoods had higer percentage of better quality houses and were expensive. Importantly, while these neighbourhoods with all new houses in good condition had pricier houses, few other neighbourhoods with new houses in good condition didn’t get good price. Looks like these are high end neighbourhoods!
There was a very strong relation between SalePrice and overall quality of the house, which seems quite logical. Also, houses with higher ground living area, meaning bigger houses, had higher sale price. Other features like age of the building and quality of kitchen also had good bearing on the sale price of the house.
Kichen quality is converted to numeric here as follows: * Ex = 3 * Gd = 2 * TA = 1 * Fa = 0
Building built or remodelled before 20 years of it’s year of sale is taken as ‘Old’.
Some of the old houses which was sold on higher price did have bigger living area
Some of the old houses which was sold on higher price did have better quality kitchen.
Old houses which was sold on higher price had bigger garage.
Linear regression model uses log transformed values of SalePrice and GrLivArea.
##
## Calls:
## m1: lm(formula = I(log10(SalePrice)) ~ OverallQual, data = df)
## m2: lm(formula = I(log10(SalePrice)) ~ OverallQual + I(log10(GrLivArea)),
## data = df)
## m3: lm(formula = I(log10(SalePrice)) ~ OverallQual + I(log10(GrLivArea)) +
## YearBuilt, data = df)
## m4: lm(formula = I(log10(SalePrice)) ~ OverallQual + I(log10(GrLivArea)) +
## YearBuilt + Neighborhood, data = df)
## m5: lm(formula = I(log10(SalePrice)) ~ OverallQual + I(log10(GrLivArea)) +
## YearBuilt + Neighborhood + KitchenQual, data = df)
##
## =======================================================================================================
## m1 m2 m3 m4 m5
## -------------------------------------------------------------------------------------------------------
## (Intercept) 4.595*** 3.340*** 0.478** 0.932*** 1.332***
## (0.012) (0.055) (0.172) (0.276) (0.271)
## OverallQual 0.103*** 0.074*** 0.053*** 0.046*** 0.040***
## (0.002) (0.002) (0.002) (0.002) (0.002)
## I(log10(GrLivArea)) 0.453*** 0.508*** 0.468*** 0.455***
## (0.019) (0.018) (0.017) (0.017)
## YearBuilt 0.001*** 0.001*** 0.001***
## (0.000) (0.000) (0.000)
## Neighborhood: Blueste/Blmngtn -0.057 -0.032
## (0.052) (0.050)
## Neighborhood: BrDale/Blmngtn -0.112*** -0.094***
## (0.025) (0.024)
## Neighborhood: BrkSide/Blmngtn 0.025 0.034
## (0.021) (0.021)
## Neighborhood: ClearCr/Blmngtn 0.099*** 0.102***
## (0.022) (0.021)
## Neighborhood: CollgCr/Blmngtn 0.032 0.035*
## (0.018) (0.017)
## Neighborhood: Crawfor/Blmngtn 0.101*** 0.105***
## (0.021) (0.020)
## Neighborhood: Edwards/Blmngtn -0.004 0.004
## (0.020) (0.019)
## Neighborhood: Gilbert/Blmngtn 0.006 0.018
## (0.019) (0.018)
## Neighborhood: IDOTRR/Blmngtn -0.052* -0.046*
## (0.023) (0.022)
## Neighborhood: MeadowV/Blmngtn -0.059* -0.054*
## (0.024) (0.024)
## Neighborhood: Mitchel/Blmngtn 0.028 0.043*
## (0.020) (0.019)
## Neighborhood: NAmes/Blmngtn 0.036 0.045*
## (0.018) (0.018)
## Neighborhood: NoRidge/Blmngtn 0.079*** 0.087***
## (0.020) (0.020)
## Neighborhood: NPkVill/Blmngtn -0.012 0.008
## (0.029) (0.028)
## Neighborhood: NridgHt/Blmngtn 0.089*** 0.082***
## (0.019) (0.018)
## Neighborhood: NWAmes/Blmngtn 0.026 0.040*
## (0.019) (0.019)
## Neighborhood: OldTown/Blmngtn -0.009 -0.008
## (0.021) (0.020)
## Neighborhood: Sawyer/Blmngtn 0.036 0.045*
## (0.020) (0.019)
## Neighborhood: Sawyer/BlmngtnW 0.013 0.018
## (0.019) (0.019)
## Neighborhood: Somerst/Blmngtn 0.028 0.029
## (0.018) (0.018)
## Neighborhood: StoneBr/Blmngtn 0.095*** 0.091***
## (0.022) (0.021)
## Neighborhood: SWISU/Blmngtn 0.004 0.013
## (0.024) (0.023)
## Neighborhood: Timber/Blmngtn 0.063** 0.069***
## (0.020) (0.020)
## Neighborhood: Veenker/Blmngtn 0.113*** 0.115***
## (0.027) (0.026)
## KitchenQual 0.036***
## (0.004)
## -------------------------------------------------------------------------------------------------------
## R-squared 0.671 0.760 0.802 0.842 0.851
## adj. R-squared 0.671 0.760 0.801 0.839 0.848
## sigma 0.099 0.084 0.077 0.069 0.067
## F 2967.527 2305.971 1954.783 281.030 290.397
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood 1306.993 1537.571 1674.635 1838.826 1881.858
## Deviance 14.158 10.314 8.544 6.819 6.428
## AIC -2607.987 -3067.142 -3339.271 -3619.651 -3703.716
## BIC -2592.136 -3046.009 -3312.854 -3466.431 -3545.212
## N 1456 1456 1456 1456 1456
## =======================================================================================================
Variables OverallQual, GrLivArea, YearBuilt, Neighborhood and KitchenQual are able to account for approximately 84.8% of the variance in the sale price of the houses.
Size of various features of the property starting from the living area to garage and basement size had positive relation with the sale price of the property. There seemed to be positive correlation between the size of living area, garage and basement too, which reflected in the sale price. Year of built was highly correlated to the sale price and overall condition of the property.
It was intriguing to note that some of the old houses sold at good price. These old houses that had higher sale price were bigger houses with good quality kitchen and bigger garage.
Linear model created using log of sale price with features like overall condition, log of ground living area, neighbourhood, kitchen quality and year built is able to account for approximately 84.8% of the variance in the sale price of the houses.
This can definitely be improved by looking at other features like SaleCondition, SaleType and LotConfig at more depth. I think inclusion number of bathrooms can also help improving the model, after analysing it’s interaction with the living area and age of the building. Also, some of the features which were excluded from the project could help improve the model.
There is a very high correlation between the sale price of the houses and its’ overall condition. It makes perfect sense, since no buyer would like to pay higher price for a property in bad shape.
This plot highlights the strong correlaton between the two features, overall quality and living area, with the main feature, sale price. Interesting to note is the strong correlation between the features overall quality and living area too, with bigger houses having better condition.
I think it was a natural to find this association between higher price for certain neighbourhoods. People paid high premium to be in certain neighbourhood even if the property was older.
The Ames Housing dataset used here contains sales within Ames from 2006 to 2010, as described here by the author of the data. The dataset has too many variables and for the prupose of this project, I chose to only keep the variables that I felt a potential house buyer would look for before buying a house, like, size of the property, number of rooms, bathrooms, overall condition of the property, location etc.
I also chose to retain sale condition and sale type as I felt it might have some effect on the sale price. Preliminary analysis of the data showed that while most of the sales were under normal conditions, sale of new houses which were partially complete fetched higher price. Also, most of the sales were conventional warranty deeds or homes that were just constructed and sold, with new homes sold at higher prices.
Most of the houses in the dataset are average and better in quality at the time of sale. As I had expected, good quality houses have higher sale price and the quality of the house is better for newer houses. I went with my preliminary hunch that people would look for overall size of a house, including the quality of the kitchen and number of bathrooms plus maybe even the size of the garage. And indeed, while most of the houses had average or good kitchen, kitchen in excellent condition did influence the sale price of the house. People paid higher price for bigger garage area as well. Interestingly, bigger houses also were in better condition, thereby selling at higher prices.
My intial thought while looking at the neighbourhood and condition1 data, which was about proximity to certain features like green belts, parks etc, that maybe people would prefer neighbourhoods which fulfilled more of these criteria. But most of the observations did not show such preference, although pricy neighbourhoods did have these features.
Another feature that intrigued me was about some old houses fetching better prices in comparision. I looked at this aspect from different prepective, starting from their condition at the time of the sale, the neighbourhood it belonged to, size ofthe house and even number of bathrooms. Finally, it turned out that apart from other things, these houses were indeed in better condition when sold.
Few features that I could have analysed further were number of bathrooms, lot configuration, sale condition and sale type. I could see there was some interaction between number of bathrooms and age of houses. Also, sale price was positively correlated with the number of bathrooms only till certain number. Abnormal sale condition also seemed to influence the sale price. Further, I noticed that plots with frontage open from three sides and the ones in cul-de-sacs had better sale price. These aspecs can be further investgated and maybe included in th model too.